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Abstract 

The answers to data assimilation questions can be expressed as path integrals 
over all possible state and parameter histories. We show how these path integrals 
can be evaluated numerically using a Markov Chain Monte Carlo method de- 
signed to run in parallel on a Graphics Processing Unit (GPU). We demonstrate 
the application of the method to an example with a transmembrane voltage time 
series of a simulated neuron as an input, and using a Hodgkin-Huxley neuron 
model. By taking advantage of GPU computing, we gain a parallel speedup 
factor of up to about 300, compared to an equivalent serial computation on a 
CPU, with performance increasing as the length of the observation time used 
for data assimilation increases. 

Keywords: Data assimilation, State and parameter estimation, GPU 
computing. Path integral Monte Carlo, Hodgkin-Huxley 



1. Introduction 

Data assimilation is designed to utilize information from measurements of 
an observed physical or biological system to complete a model of that system. 
Completion means accurately estimating unknown parameters in the model and 
accurately estimating the unobserved model state variables at the end, time T, 
of the observation window. The information on the parameters p and the full 
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set of model state variables xt allow forecasting the state of the system using 
the model. Also knowledge of the state xt permits assessing the quality of the 
model. 

Typically the model will take the form of a system of nonlinear differen- 
tial equations, either based on the underlying physics of the system in ques- 
tion, or designed to reproduce observed phenomena. To make quantitative 
predictions the parameters of the model need to be adjusted to fit the data. 
Since neither the model nor the measurements are exact, the problem should 
be treated probabilistically. The knowledge we gain from the observations al- 
lows us to estimate the conditional probability distribution P(xT|yi:M), where 
Yi-.M = {yi,y2, • ■ • jYm} is the collection of observations up to time Im = T. 
Moments and marginal distributions of parameters or state variables may also 
be estimated. 

Within an observation window, one makes measurements at time t„ = 
{ti,t2, ■ ■ ■ ,tM — T}. At each time L observations are made, represented 
by the L-dimensional vector y{tn) = Yn- The observations provide information 
about the system state, but it is not complete information, because the obser- 
vations are likely to be noisy, and more importantly, the dimension i of y is 
typically less than the dimension of the system state Xn,a] a — 1,2, . . . , D > L. 

The problem of data assimilation was formulated in a probabilistic way 
some time ago [ll, The key question is this: given a time series of obser- 
vations Y = yi:j\/ = {yi, y2, Ym} and a model, what model state histories 
X = {xq, xi, . . . , Xjvf} and set of parameters p = {pi,p2, ■.■,PNj,} could have 
produced the observed data? The conditional distribution over state histories 
and conditioned on observations is called P(X|Y). 

Various methods have been developed to solve this problem. Kalman filter- 
ing is the exact solution when the model is linear and the noise is Gaussian 
and white in time. A more general and computationally intensive approach is 
particle filtering 0, @ , where the distribution over states is approximated by 
an ensemble of particles. A similar approach is to phrase the problem in terms of 
continuous time path integrals 0, S Q • The continuous time path integrals are 
approximations, and in discrete time an exact formulation has been given [loj . 
In the path integral approach, the whole time history of states X is considered 
at once, instead of sequentially calculating the states at discrete time steps, x„. 

What we really want are moments of functions on the path X or histograms 
of the marginal distributions, P(a;„,i|Y), where Xn,i is one component of X. 
These can be expressed as path integrals over all possible state histories X, with 
each path weighted by P(X|Y). The path integrals can then be approximated 
using standard Markov Chain Monte Carlo techniques. The main difficulty with 
this approach is that the methods typically require very many iterations to get 
good results, and so they can be very computationally expensive. 

This problem can be ameliorated by taking advantage of parallel computing 
technology using a graphics processing unit (GPU). These computations can 
be done on inexpensive desktop computers, with an off-the-shelf GPU, which 
can execute hundreds of threads in parallel (see [ll| for another use of GPU 
computing in physics) . The Markov chain still must be updated sequentially of 
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course, but the computations done on each iteration can be highly parallehzed. 
Since in the path integral approach, all time steps are considered at once, it is 
possible to parallelize the algorithm such that each thread does the computa- 
tions associated with a different time step. This leads to a large decrease in the 
amount of computation time required, and makes many problems become prac- 
tical. We show an example problem using simulated neuron voltage time series 
data, and the Hodgkin-Huxley neuronal model as inputs into the procedure. In 
this example we get a parallel speedup factor of up to about 300. 



2. Path Integral Formulation of Data Assimilation 

This section is review of previous work 

fliMHEl and an introduction 
of notation. The idea is to express the answer to probabilistic data assimilation 
questions as path integrals over all possible state histories, and all possible 
parameter values. 

The inputs into this formulation will be a model of the observed system and a 
time series of observations, together with uncertainties associated with these two 
ingredients. The model will be in the form of Markov transition probabilities, 
which describe how the model state evolves in time. The transition probabilities 
may come from a discrete time map or a system of ordinary differential equations 
which is then discretized. In this section we will show how these ingredients are 
combined to form a probability distribution that is a function of the time history 
of the model state conditioned on the observations. 

The first ingredient is the model. To make a model, we first represent the 
state of the system at each time step n g {0, 1, . . . , Af } as a D-dimensional vector 
x„. We assume the dynamical model is Markov, and so we can represent the time 
evolution using a transition probability P(x„_|_i |x„) which depends only on Xn+i 
and x„, and not any previous states. This tells us the probability of transitioning 
from state x„ at time step n to state x„_|_i at time step n+1. If there is no noise 
in the model, this is a deterministic transition in which case P(x„_|_i|x„) will be 
a delta function. When noise is present, typically due to a noisy environment 
or arising from resolution errors associated with discretization of the model 
in time and space, this will be represented as stochastic transition in which 
case P(x„+i|x„) will a broader version of a delta function, with a width which 
depends on the noise level. The D dimensions of the model may include any 
time-independent parameters, by promoting the parameters to state variables 
with time evolution given by p = 0. 

The second ingredient is a time series of measurements, Y = yi-.M = 
{yi- y2, • • • , Ym}- At each time step labeled by n e {1, 2, . . . , Af} the measure- 
ment is an L-dimensional vector y„. Part of the model development consists of 
associating a known 'observation' function h(x„) of the model state variables 
x„ with the observation y„. Usually one assumes that the observations and 
the observation function are related by additive noise added to the observation, 
so y„ « h(x„) -|- noise, but we first discuss a more general case. Typically 
L < D, so there are 'hidden states' which must be inferred using all the infor- 
mation available from observations and from the model. More keeping with a 
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physicist's view of measurements, we refer to the D ~ L unmeasured states as 
unobserved state variables. We need to estimate them as weh as the fixed pa- 
rameters. We wiU assume later that the observation only depends on the state 
at the current time step. 

We then consider the discrete time evolution of the model state by defin- 
ing a path variable X = Xo:Af = {xo,Xi, . . . ,XAf}. This describes a trajectory 
through the the £)-dimensional state space. This formulation considers the 
entire path simultaneously, assuming all the observations have already been col- 
lected. In signal processing language this method would be called a 'smoother', 
as opposed to a 'filter' which only uses observations from the past. 

We then use the Markov transition probabilities, P(x„ |x„_i) specified by the 



model to find P(x„|yi:„_i) from P(x„_i|yi:„_i). We also use Bayes' rule [14| 
to introduce the observations into the formulation. The result is 



-P(x„|yi:„) = -^(^"1^"'^!^" /"dx„_iP(x„|x„_i)P(x„_i|yi:„_i), 

"(y«|yi:n-i) J 

-P(yn7 ^n|yi:n— l) f j j-,/ i i \ 

= ^7 — ^ 7^7 — ^ 7 / dx„_iP(x„|x„_i)P(x„_i|yi:„_i) 

P(x„|yi:„-i)P(y„|yi:„-i) J 

- eCMI(x„.y„|y,„_i) j dx„_ i P(x„ |x„_i )P(x„_ i |yi:„_ i ) , (1) 

where we identify the factor in front of the integral as the exponential of the 
conditional mutual information between the state x„ at time tn and the mea- 
surement y„ at that time. 

This is the key equation that is used by all recursive Bayesian estimation 
methods. It is the starting point for particle filter methods [5| which use this 
equation to perform the forecast step to advance the probability distribution, 
as approximated by a weighted particle distribution, and the analysis step to 
incorporate measurements. Particle filtering methods process the measurements 
and update the model state sequentially. Here we take a different approach and 
consider the entire model evolution and the whole time series of observations 
simultaneously. 

To write P(xM|yi:M) over a path in state space we first look at the initial 
time step by setting n = 1 into Eq. ([T|) to arrive at 

^(Xl|yi:l) = ^^^ir^ I dXoP(Xi|xo)P(Xo), 

^(yi) J 

and we note that yi:o means there are no measurements to condition the proba- 
bilities by. We then apply Eq. ([T]) itcratively M — 1 additional times to establish 

P(XMlyi:Af) - / n rfx„_i%^^^pi^^i^P(x„|x„_i)^(X0)- (2) 
J r7=l -P(yn|yi:n-l) 

This is the marginal distribution of the state at the last time point xm- 
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This is also the integral representation of the solution to an underlying 
Fokker-Planck equation for the problem of a state x„ satisfying noisy dynamical 
equations being informed by noisy measurements y„. When the noise is Gaus- 



sian the underlying Fokker-Planck equation is known as the Zakai equation [15 |. 
We can use the definition of a marginal distribution 

-P(xAf|yi:M) = y"(ixo . . .(iXA/-lP(xO:Af|yi:M), 

to express the conditional probability distribution as a function of the entire 
path Xo:A/ via 



P(xO:Adyi:Af) = P(X|Y) - II 



n l^n? yi:n— 1 ; 



-P(y«|yi:n-i) 



P(x„|x„_i)F(xo). (3) 



To calculate the marginal distribution at a particular time step < n < M, 
P(x„|yi.A/), we integrate over the state variables at all the other time steps. 
Note that this formulation for -P(x„|yi.A/) includes the information gained from 
the whole time series of measurements yi:A/ and not just the the past measure- 
ments yi:„. 

We define the action, A, by taking the log of Eq. ([3]) and dropping additive 
terms independent of X, 

M M 

-A(X|Y) =^logP(y„|x„,yi:„_i) + ^logP(x„|x„_i)+logP(xo). (4) 

n— 1 n— 1 

The first term is where the information from the measurements is incorporated, 
the second term is where the dynamical model is incorporated, and the last 
term is the prior distribution of initial states and parameters, which takes into 
account any previous assumptions. Using this definition, we can write 

P(X|Y) oce-^(^IY). 

The constant of proportionality is independent of X, so the expectation of any 
function G'(X) on the path X, conditioned on the measurements Y, is written 
as 

/dXG(X)e-^(^|Y) 



£;[G(X)|Y] = (G(X)) = rJ.-Viv) ■ (5) 



2.1. Assumptions leading to simplified form of action 

We assume that the measurement at time step n depends in a known and 
deterministic way given by h on the current state, x„, plus multivariate Gaussian 
noise. We also assume that y„ is independent of any earlier measurement, and 
so 

y„ -h(x„) =AA(0,Ro"') 

= (6) 
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where A/'(0, Ro~^) represents a random vector drawn from a multivariate Gaus- 
sian with zero mean and covariance matrix Ro~^. We these assumptions, the 
term inside the first sum of Eq. ^ becomes 



P(y„|x„,yi:„_i) ^ P(y„|x„) oc exp 



Next we make some assumptions about the dynamics. We define the model 
error term by by discretizing the model differential equation, ^ = F(x(t)), 
using the trapezoid rule with time step At: 



At 



[F(x„)+F(x„_i)] 



(7) 



If the dynamics were deterministic, then e„ = (exact in the limit where 
At — > 0) for all n. Wc allow noise into the dynamics by relaxing this condition 
by replacing the zero with a stochastic term: 

e„ =AA(0,Rd"'). 

We have allowed for some types of random error in the model by introducing 
a noise term, which is an additive Gaussian noise applied at every time step. 
The noise is parameterized by the covariance matrix Rd~^ = AtT, where F is 
a diffusion matrix. We these assumptions, the term inside the second sum of 
Eq. dl]) becomes 

1 



P(x„|x„_i) oc exp 



Rri 



Finally, for simplicity we assume that P(xo) is uniform inside the boundaries 
and zero outside. This means the initial states and parameters can be restricted 
to fall inside a boundary, and that the last term in Eq. (|4]) can be dropped 
because it is a constant inside the boundary. 

The action can be written as A = Ao + Ad, where 



1 



■ Rri 



are the contributions from the observations and the dynamics respectively. 



3. Monte Carlo Evaluation 

Now that we have a formulation of P(X|Y), we would like to calculate 
quantities such as means and covariances, which can be used to make estimates 
and predictions. These quantities can be written as path integrals of the form 
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given in Eq. ([SJ, where G(X) is chosen to be some function of the path that is 
of interest. 

The challenge is to evaluate these path integrals. One way to do this is to 
generate a series of paths {X'^-'^\...,X('^^} that are distributed in path space 
according to P(X|Y) oc exp[— A(X| Y)], then we can use those paths to approx- 
imate the distribution with 



P(X|Y) « i^5(X-X(^'] 



We can then calculate expectation values of any function of the path with 



(G(X)) ^ J dX G(X)P(X|Y) 



lX]G(X«). (8) 



The sample paths can be thought of as representing many possible time evo- 
lutions of the system state that could have produced the observed data when 
measured. The paths that are more likely to have produced the observed data 
will be generated more times than the paths which are less likely to have pro- 
duced the observed data. 

Now we need a method which will produce a series of paths that are dis- 
tributed according to exp[— A(X)]. There are several path integral Monte 
Carlo methods, such as Metropolis or Hybrid Monte Carlo, that do exactly 
this [ll,[ll[l7|. 



3.1. Metropolis Monte Carlo 

One of the simplest and oldest approaches is the Metropolis Monte Carlo 
method [li'] . This method works by generating a sequence of paths {X*^^^ , • • ■ , X*^'^) } 
that we will call Monte Carlo paths by a random walk through path space. The 
method is an example of a Markov Chain Monte Carlo method, because the 
paths are generated in sequence and the next path is generated only from the 
current path in a stochastic way. The random walk is biased in a particular way 
as described below so that the sequence of paths that are generated come out 
distributed according to exp[— A(X)]. 

The method works by generating a new path X'"^^^ from the current path 
X^"^ with a two step procedure. First, a candidate path X' is proposed by 
adding an unbiased random displacement to the current path X = X*^"-*. The 
displacement may be to only one component or all the components, and may 
be drawn from any type of distribution, as long as it is unbiased; this assures 
that X — )■ X' is as likely to occur as X' — >■ X. 

The proposal distribution actually does not need to be unbiased if the ac- 
ceptance probability is modified in the way shown in [isj . There are several 
methods which make a smarter choice of the proposal distribution, such as the 
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force bias method [l9|. Such methods may be converge with fewer iterations, 
but they have the added complexity of requiring derivatives of the action to be 
computed. 

Next the proposed path is either accepted (X("+^) — X') or rejected (X("+^) = 
X). The probabihty for acceptance is 

^accept 

(X',X) = min(l,exp[-AA(X',X)]), (9) 

where Ayl(X',X) = A(X') — v4(X) is the change in action which would be 
caused by changing X to X'. This says that if a proposed change will lower the 
action it should be accepted, and if the proposed change increases the action it 
should only be accepted with probability exp[— AA(X', X)]. Note that only the 
change in action is required, so the full action never needs to be computed. This 
means we do not need to keep track of additive constants to the action, and can 
save a lot of computation time by only computing the terms in the action that 
will be changed by the update. 

4. Parallel Implementation for GPUs 

The Metropolis Monte Carlo method is simple and powerful, but it has the 
drawback of requiring very many path updates to get accurate statistics. Since 
so many path updates are required, the computation becomes very expensive 
in terms of computer time. One way to deal with this problem is to take ad- 
vantage of parallel computing technology, using a Graphics Processing Unit 
(GPU) . With GPU technology, and using Compute Unified Device Architecture 
(CUDA), it is possible to execute hundreds of threads of execution simultane- 
ously. Typically each thread will perform the same operations, but on different 
pieces of the data. Of course since the paths are updated sequentially, the path 
update process cannot be run in parallel. However, the many computations 
needed on each iteration can be done in parallel by having different threads 
work on different time steps. 

4-.1. The general procedure 

First the current path X is set to an initial guess, and the observation time 
series Y is loaded from a file. The path X includes the state vector x„ at every 
time step n = 0, 1, . . . , M , and the parameters. The states and parameters are 
treated differently, because the parameters are forced to be time-independent 
but the states may vary in time. The current path, the observation time series, 
the external drive signal (if present), and a running sum of moments of the path 
components are allocated in GPU memory and initialized with the appropriate 
data. 

Then the path update loop begins. First the even n states are updated, and 
then the odd n states (see. Fig. [T]and Fig. [2|). The reason for doing the state 
update in two steps is to uncouple the state vectors: to calculate the change in 
action due to perturbing x„, we need to know x„_i and x„_|_i, but none of the 
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1. For each odd n, a thread does the fohowing: 

(a) Proposed change: xjj = x„ + An • U(— 1, 1) 

(b) Calculate and store new RHS of model equations: FJ^ — F(xJ^) 

(c) Calculate model error terms: e„, ej^, e„+i, e'i^_i 

(d) Calculate observation error terms: dn,6'^ 

(e) Change in dynamical part of action: 

(f) Change in observation part of action: AAo^n = [(^^i)^ — ^n] 

(g) Total change in action: AAn = AAd.n + AAo^n 

(h) Acceptance probability: P^cc = niin(l, e"'^'^") 

(i) If U{0, 1) < Pace then accept the change: xj^ x„ and FJ^ — ;> F„ 

2. For each even n, a thread executes the same steps shown above 

3. Perturb parameter pi. Launch M + 1 threads to compute the resulting 
change in action, and then either accept or reject the change with the usual 
Metropolis rule. Repeat for each parameter, 1 = 1,2,..., Np in sequence. 



Figure 1: Pseudo-code for parallel state and parameter update. For simplicity the matrices 
Ro and are set be be Ro and ij^ times the identity matrix, respectively. U{a, b) is a 
random number between a and b drawn from a uniform distribution, and U is a vector of D 
such components. 



A 




n-1 n n+1 n+2 n+3 
Time step 



Figure 2: Schematic of the state update process. All even time steps are updated simultane- 
ously by different threads. The solid circles show the current states (for illustration they are 
shown as scalars, but generally are vectors). The open circles show the proposed new states, 
and the dotted lines represent the dynamical terms in the action which would change. After 
all even time steps have a chance to update, the same procedure is done to all the odd time 
steps. 
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other state vectors. This way each even n, and then each odd n can be updated 
independently, in any order or simultaneously. 

Once all the states have had a chance to be updated, then each parameter 
Pi is given a chance to change in sequence, I = 1,2, .... Np. One parameter, 
Pi, is perturbed, and then M + 1 threads are launched to calculate the change 
in action. Each thread is assigned to one n, and it calculates the new RHS of 
the model differential equation F'^ = F(x^). From this information the total 
change in action is calculated, and one decision about whether to accept the 
proposed change or not is made using the usual Metropolis rule. If the change 
is accepted, then F^ F„ for all n and pi. 

After all the states and parameters have had an opportunity to update, the 
current path can be used to update the path statistics. Typically this will be 
the mean and variance of each component of the path, but eovariances or higher 
moments could be of interest too. Another possibility is to record bin counts 
of some components of interest to make a histogram. This step is skipped 
for the first Ninu path updates, and after that only done every Ngkip-th path 
update. The statistics collection happens on the GPU, also in parallel, and so 
the individual paths are not recorded. This avoids costly data transfers between 
the GPU and CPU. 

4-2. More details 

There are several more details which will now be discussed. The first Ninu it- 
erations are the initialization phase, during which no path statistics are recorded. 
This phase is necessary to remove the influence of the initial guess path. Two 
other things happen during this phase: simulated annealing, and automatic 
adjustment of the MC step sizes Aj. 

The simulated annealing is done by putting a multiplier /3 < 1 in front 
of the dynamics term of the action, Ad, and gradually increasing /3 during 
the initialization phase, up to its final value of /3 = 1. This way the action 
will initially be dominated by the quadratic observation term, Ag, which is 
smooth and has a single well-defined minimum. Specifically, /? is initially set to 
/3 = /3o < 1 and multiplied by the constant factor f/s after each iteration, until 
13=1. The constant multiplying factor is 

= (l//3o)(^/^-'^ (10) 

The /3 plays a role similar to inverse temperature T in a system with a Boltz- 
mann distribution P(x) oc exp(— _B(x)/fcr). In our case the conditional path 
distribution is P(X|Y) oc exp(-^o(X|Y)) exp(-/3^d(X)). The two situations 
are not quite the same because in the second case there is a path-dependent and 
observation-dependent term, ^o(X|Y) which is not multiplied by /3. 

An automatic procedure for adjusting the the MC step sizes, A^, is useful 
because there are typically many different step sizes which would be hard to tune 
by hand. It is important to tune the sizes because if they are too small, then the 
proposed path changes will be very small. They will be likely to be accepted, 
but very many iterations will be required for the path to change significantly. 
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On the other hand, if the step size is too large, most proposed changes will be 
rejected, and in this case also, many iterations will be required for the path to 
change significantly. In this case there is a different step size for each component 
of the path indexed by i. The delta adjustment rule used was 

A, ^ A,: 



where Nacc.i is the number of accepted changes for component i over the past 
Nskip number of iterations, face is the target acceptance rate, and a is a con- 
stant that controls the adjustment rate. For a different approach which utilizes 
a proportional integral controller to tune step sizes see [20[ . This rule is applied 
every TV^fcip-th iteration, and only during the initialization phase. It is impor- 
tant to note that adjusting the step sizes will bias the distribution, so samples 
generated during the adjustment phase should not be used when calculating 
statistics. It is run in parallel on the GPU, with one thread assigned to each 



path component. It was shown in [2lJ, |22| that the optimal value for the ac- 



ceptance rate is 0.23 in the limit of a infinite dimensional multivariate normal 
target distribution, so that is the target acceptance rate that we use here (see 
also [l^)- It is important that each parameter and state variable are tuned 
independently, because the scales may be very different. 

Another useful thing to do is to set boundaries on parameters. This can be 
done by adding a large penalty to the action when a proposed change would 
move the parameter outside of the boundary. This makes the proposed change 
be rejected. This is one way of incorporating a prior distribution of parameters, 
in the simple case where the distribution is uniform with the bounds and zero 
outside of the bounds. 

The random number generation is done in parallel using the "Hybrid Taus- 
worthe" algorithm described in [IS]- Each thread is given its own initial seed, 
and so each thread will generate a different pseudo-random sequence. To test 
the quality of the random number generator we generated 512 different se- 
quences in parallel of 10,000 random numbers each, and then computed the 
cross-correlations and autocorrelations. All correlations came out between -0.04 
and 0.04, except the autocorrelations at zero time lag, which was one by choice 
of normalization. 



4-3. GPU- specific details 

There are several important limitations to keep in mind when designing 
CUDA code. The threads are grouped into blocks which run together. The 
amount of threads allowed in a block depends on how many resources each 
thread uses (often limited by shared memory usage or by number of registers 
used), but its maximum possible value is 1024 threads per block. The threads 
within a block can communicate with each other through shared memory, and 
can synchronize with each other. In contrast, threads in different blocks can 
have no interaction, and may be run in any order or simultaneously, so the code 
must be designed such that the order of block execution does not matter. 
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The GPUs have a large amount of global memory which is accessible to all 
threads, but has relatively high latency. The CPU can read or write to the 
global GPU memory, but this process should be kept to a minimum, because 
this process is relatively slow. Each block has a small amount of shared memory 
(49KB on the NVIDIA GTX 460 for example) which is much faster. The shared 
memory should be used where possible for temporary storage within the kernel 
call, but since its size is limited, there is often a trade off between using shared 
memory and being able to run more threads per block. 

A common approach, which was used in the state update kernel, is to load 
the necessary pieces of information from global memory into shared memory at 
the beginning of the kernel. In this case the data was the F„ values and the x„ 
values for the relevant time steps. There is a overlap of one time step on the 
data which is read (but not of the data which is changed) between each block 
to make the neighboring values accessible. Since the blocks cannot synchronize 
with each other, it is necessary to do the even and odd n values in two separate 
kernel invocations. This ensures that all the even n finish before any odd n 
start. 

The update process for the parameters is slightly more tricky, because in 
this case only one decision about whether to accept or reject is made, but it is 
based on the calculation of several blocks. This is handled by having each block 
calculate a change in action due to its assigned time steps. Each thread within 
the block calculates a change in action, and then participates in a parallel sum 
to get a total change in action for the whole block. Then after all the blocks 
finish another kernel is called which runs on a single thread and sums the block 
sums and decides whether to accept the change. If the change is accepted, 
the parameter is updated, and then the F„ values are updated. Actually two 
copies of the F„'s are kept in global memory, one which was calculated from 
the current path and one which was calculated from the proposed path. There 
is a pointer to select which copy is current, and this pointer is fiipped when the 
change is accepted, so that large memory transfers are avoided. 

The observation time series and external drive signal (if present) are loaded 
into constant memory, which is a small (65KB on NVIDIA GTX 460) memory 
which can be quickly read by the GPU but not written to. For longer time series, 
this memory may be too small, in which case global memory is used instead. 
Also the data is organized in memory so that threads with contiguous indices 
will access contiguous regions of memory. This allows the memory transfers 
to be coalesced and happen more efficiently. All calculations were done using 
single-precision floating point arithmetic. 



5. Results 

5.1. Example problem: Hodgkin- Huxley 

We now discuss the Hodgkin-Huxley neuronal model as an example problem. 



For other approaches to the similar problems see [25|, |26|. The model, first 



developed in 1952 based on experiments on the squid giant axon [27| . treats the 
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cell membrane as a capacitative layer with three types of conducting channels. 
The conductance of the sodium and potassium ion channels depends on the 
voltage across the membrane V(t). The differential equation for voltage is: 



= PlIsUm{t) 

+ P2mHt)hit){p3~Vit)) 
+ Pin\t) {P5-V{t)) 

+ PeiP7-V{t)). (11) 

The first term is an external stimulus current injected into the cell, the 
second term is the sodium ion current, the third term is potassium ion current, 
and the last term is a leak current. The voltage dependence of the conducting 
channels is modeled through gating probabilities ■m{t) , n{t) , h{t) , all between 
zero and one. The dynamics of the gating probabilities is specified by first- 
order kinetic equations with opening and closing rates which are functions of V. 
Equivalently, the equations can be expressed in terms of a steady state value 
UooiV) and a time constant Ta{V), where a stands for m, n, or h: 



da ^ Qoo(F) ~a{t) 

dt Ta{V) ■ ^ ' 

Phenomenological analytic expressions for these functions were discussed 
empirically by Hodgkin and Huxley. For this example we use different but 
qualitatively similar functions: 

a.oo{V) 

ra{V) 

Simulated data was created by integrating the Hodgkin-Huxley model, with 
a chaotic stimulus current generated from one of the variables of the Lorenz 
model Values for the parameters in the gating equation were chosen to 

qualitatively match what was found by Hodgkin and Huxley, and are given in 
Table [TJ Simulated measurements of transmembrane voltage V(t) were gener- 
ated from the model output with a sampling frequency of 25 kHz. This simulated 
data along with the chaotic stimulus current is shown in Fig. |3l 

The model has a total of four state variables, V,n,m, and h, but only the 
voltage is observed. This corresponds to the typical situation in real experi- 
ments. In a current clamp experiment on a single isolated neuron, a known 
current is injected into the cell while the transmembrane voltage response is 
recorded. In this version of the model, seven numbers pi^p2, ■ ■ ■ ,P7, were treated 
as unknown parameters, and all other constants were fixed at the same values 
for data generation and for the analysis. One can determine all of the constants 



1 

2 + 



tanh 



V-Vg 



Tai 1 — tanh 



V -Va 



(13) 
(14) 
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Figure 3: The simulated voltage measurement and the chaotic stimulus current. These two 
time series plus the Hodgkin- Huxley model (with seven unknown parameters) arc inputs into 
the PIMC procedure. The V{t) data is 8000 discrete points (although it is shown as a solid 
line) sampled at 25 kHz. 



in aooiV) and Tsa{V) with the same methods. Our focus is on the compu- 
tational procedures here. In the present case the observation function h(x„) 
simply returns the the voltage component of x„, but in general it could be any 
function of x„. 

In the example calculation, one million path update iterations were per- 
formed, with Nskip = 400. The first half million were excluded from the 
statistics {Ninn = 500,000). Simulated annealing was done over the first 
Ncooi = 100,000 iterations, with /3o = 0.01, and /3 was incremented by mul- 
tiplying by fp (Eq. [TU)) after each iteration. The target acceptance rate was 
face — 0.23 and the step size adjustment rate was a — 0.02. The initial settings 
for Monte Carlo step sizes were = 2 x lO^'^ for voltage and A,; = lO^'^ for 
the n, m, and h state variables. 

The number of data points used was M = 8000, or 320 ms of data. The 
matrix Rd was set to be [Rd]v,y = 100, [Rd]a,a = 10^ (a e {n,m,h}), and aU 
other components zero. Since the observation is one-dimensional Ro is a scalar, 
and was set to 100. Figure |4] shows the evolution of one parameter, pi, to give 
an idea of the equilibration process. The mean and standard deviation of each 
component of the path, that is each state at each time step and each parameter, 
was calculated. The mean and the standard deviation of the unobserved states 
shown in Fig. [51 The estimated parameters values are given in Table [5] This 
calculation took 683 seconds to run for one million iterations on an NVIDIA 
GTX 460 GPU which has 224 cores. 

For timing purposes, the same calculation was done over a range of time 
series lengths from M — 100 up to M = 40,000, but with 1/lOth as many 
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Table 1: Constants used in the kinetic equations. 



Name 


Va [mV] 


AK [mV] 


Tao [ms] 


Tai [ms] 


n, K Activation 


10 


30 


1.0 


5.0 


TO, Na Activation 


25 


15 


0.1 


0.4 


h, Na Inactivation 


5 


-15 


1.0 


7.0 



Table 2: Estimated parameters. 



Name 


Mean 


St. Dev. 


Actual 


Units 


Pi 


1.02 


0.04 


1 


1/pF 


P2 


127 


2 


120 


1/ms 


P3 


115 


0.1 


115 


mV 


Pi 


20.3 


0.3 


20 


1/ms 


P5 


-12.0 


0.08 


-12 


mV 


P6 


0.33 


0.03 


0.3 


1/ms 


P7 


9.3 


1.0 


10.6 


mV 



1.3 



1.2 - : 
i 




500 1000 1500 2000 2500 



Sample Number 



Figure 4: The evolution of one of the parameters as the MC path update procedure runs. 
The first 1250 samples where discarded. Simulated annealing was applied during the first 250 
iterations. Eventually the parameter settles down and fluctuates around a mean of 1.02. 
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Figure 5: The output of the PIMC process. These are the three unobserved states. The black 
hne is the true value (which is known because the data was simulated). The solid region is 
centered on the mean and extends to plus or minus 4 standard deviations from the mean. 

iterations, and the execution time, Tqpjj, was recorded. The same calculation 
was done on two different NVIDIA CPUs, the GTX 460 with 224 cores and 
the GTX 470 with 448 cores. In all cases, the number of threads per block 
was set to 100. For comparison a similar, but not exactly the same, calculation 
was done on a single core of an Intel Core 13 CPU. A linear time scaling of 
Tcpu = M X (1.24 s) was fit from several trials on the CPU. In Fig. [6] the 
parallel speedup factor, Tcpu/Tgpu, is displayed. 

6. Discussion 

Running the PIMC in parallel decreased the computation time by a factor of 
up to 200 or 300, depending on which GPU was used. As the time series becomes 
longer, more of the GPU cores are simultaneously utilized. The specifics of 
how the speedup factor varies as a function of time series length depends on 
the particular GPU, and on how CUDA schedules the threads to utilize the 
available resources. 

One way to assess the quality of the model is to use the model to integrate 
the final state at time t — T forward in time and compare to additional observa- 
tions. One possibility is to use the mean final state {xt) and mean value of the 
parameters, ipi), to generate a single trajectory, which can then be compared to 
the observed time series. However, this may not be adequate when the dynam- 
ics is noisy, or when the model is chaotic. Another approach is to generate an 
ensemble forecast, with the appropriate amount of noise added into the model 
integration. Many noisy model integrations should be done, by using many 
different final states and parameter values as generated by the Monte Carlo 
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Time Series Lengtli, ^ 



Figure 6: The parallel speedup factor as a function of time series length M, using two different 
NVIDIA GPUs. The parallel speedup factor is defined as Tcpu /Tgpu ■ The performance 
of the two GPUs is about the same for small M , where they are underutilized, but as M 
increases, the GTX 470 performs better because it has more cores. 



procedure, and many different noise sequences. Then the ensemble forecast can 
be compared to the single observed trajectory in a statistical way. 

Another way to asses the quality of the model is to look at the model errors, 
e„, and the observation errors, These will have specific values for each 
Monte Carlo path generated, since they are functions of the path, X. If the 
model is correct, then the statistics of these quantities should be consistent 
with our assumptions about the noise. In the example problem the assumptions 
are that both noises are Gaussian and white. Therefore the means should be 
zero, the variances should come out as specified by the Ro and Rd matrices, 
and there should be no correlation in time or among the different components. 
In the example problem presented here, there is a source of model error which 
comes from the discretization of the model differential equations. This appears 
as deviations of (e„) from zero at the times when V{t) spikes. 

The method presented in this paper is quite general, and can be applied 
to any dynamical system. All that is needed is a time series of observations, 
a Markov model of the dynamics, and some assumptions about the noise in 
the dynamics and the noise in the observations. By utilizing GPU technology, 
it becomes practical to run the method on realistic problems using desktop 
computers. 
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